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One strategy for machine-aided indexing (MAI) is to 
provide a concept-level analysis of the textual 
elements of documents or document abstracts. In 
such systems natural-language phrases are 
analyzed in order to identify and classify concepts 
related to a particular subject domain. The overall 
performance of these MAI systems Is largely 
dependent on the quality and comprehensiveness 
of their knowledge bases. These knowledge bases 
function to (1) define the relations between a 
controlled indexing vocabulary and natural language 
expressions; (2) provide a simple mechanism for 
disambiguation and the determination of relevancy; 
and (3) allow the extension of a concept-hierarchical 
structure to all elements of the file. After a brief 
description of the NASA Machine-Aided Indexing 
system, concerns related to the development and 
maintenance of MAI knowledge bases are discussed. 
Particular emphasis is given to statistically-based 
text analysis tools designed to aid the knowledge 
base developer. One such tool, the Knowledge Base 
Building (KBB) program, presents the domain expert 
with a well-filtered list of synonyms and con- 
ceptually-related phrases for each thesaurus 
concept. Another tool, the Knowledge Base Main- 
tenance (KBM) program, functions to identify areas 
of the knowledge base affected by changes in the 
conceptual domain, for example, the addition of a 
new thesaurus term. An alternate use of the KBM as 
an aid in thesaurus construction Is also discussed. 


1. Introduction 

The primary goal of natural language processing (NLP) 
is to establish a machine system that can effectively determine 
the conceptual content of written text and manipulate those 
concepts in order to provide a response which mimics some 
human intellectual activity. Although this goal has not been 
achieved, many of the analysis methods developed in support 
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of natural language research have been incorporated into 
operational systems that function as computer aids rather than 
as fully automatic techniques. In the area of machine-aided 
indexing (MAI), various strategies for statistical and syntactic 
analysis and knowledge base design have been used. The 
function of such machine-aided indexing systems is to provide 
a concept-level analysis of the textual elements of documents 
or document abstracts — the final output being a list of candi- 
date index terms from an established classification scheme or 
thesaurus. 

This paper will present an outline of the functional 
elements comprising knowledge-based MAI systems along 
with a description of a system currently in operation at the 
NASA Center for AeroSpace Information (CASI). Secondly, 
the development and maintenance of MAI knowledge bases 
are discussed with particular emphasis being given to statisti- 
cally-based text analysis tools designed to aid the knowledge 
base developer. Finally, the application of these tools as an aid 
in thesaurus construction is described. 


2. Machine-Aided Indexing through Text Analysis 
Functional Elements 

As inferred above, the types of MAI systems of concern 
here are those which function through the analysis of text 
There are, of course, other strategies for providing assistance 
to the indexer. The MedlndEx Project 1 at the National Library 
of Medicine, for example, is currently developing an interac- 
tive expert system based on the “rules of indexing”, where the 
indexer is guided through the indexing process in a somewhat 
heuristic manner. 

The operations that comprise text-based MAI systems 
(Fig. 1) can be generalized as the following: 

- delineation of text phrases 

- identification/reduction of semantic units 

- semantic analysis 

The primary task of the first of these operations is to 
establish boundaries or parameters that will assure that two or 
more non-adjacent words, selected for subsequent semantic 
interpretation, actually represent a grammatically ‘correct’ 
association (such as between a modifying adjective and the 
noun it was intended to modify). This can be carried out by 
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using techniques that vary in complexity from simple phrase- 
breaking procedures to formal syntactic parsing. Simple non- 
syntactic techniques have the advantage of being time- 
efficient; full parsing, on the other hand, may provide greater 
accuracy. 

The second operation involves the extraction of multi- 
word strings (and single words) which may express concepts 
within a given subject domain. This operation typically re- 
quires a means for the successive concatenation of words 
within phrases and may optionally include aprocess for word- 
stemming (as in PHOTOELECTRICAL to PHOTOELEC- 
TRIC) or phrase normalization (as in SURFACE OF THE 
MOON to MOON SURFACE). 

In the third operation, the final forms of the words and 
word-strings identified are then translated into appropriate 
indexing terms from the controlled vocabulary. One of the 
primary functions of the knowledge base is to serve as the 
equivalency table for this translation process. 



Figure /. MAI Systems Operations 


Knowledge Base Design and Function 

The core of the MAI knowledge base is the thesaurus or 
classification scheme used for indexing. These controlled 
vocabularies represent the concepts of interest within a par- 
ticular subject domain. The MAI knowledge base can be 
viewed as a conceptual network that (1) defines the relations 
between controlled thesaurus terms and natural language 
expressions, and (2) allows the extension of thesaurus hierar- 
chical structure to all elements of the file. 

Beside containing entries that map controlled terms to 
textual expressions, the knowledge base contains entries that 
represent decisions regarding the relevancy of particular 
concepts. For example, within an aeronautics domain, the 
concept AIRCRAFT is much too broad in meaning to be a 
relevant indexing term for most instances where the word 
aircraft appears in text In this case, specific entries in the 
knowledge base would affect a search for a larger semantic 
unit (such as aircraft stability, A-320 aircraft, aircraft con- 
struction materials , etc.). Also, file entries are included that 
serve to disambiguate certain words; for example, whether the 
word matrices refers to mathematical matrices or material 
matrices. 

Naturally, the form that any particular knowledge base 
takes on is dependent on how the other system operations are 
carried out. The procedures selected for initial phrase delinea- 
tion and analysis define what kinds of information need to be 
represented in knowledge base entries and also how large an 


operational file will need to be (e .g. , the use of word-stemming 
and phrase normalization can reduce the number of required 
entries). Likewise, the strategies used for disambiguation and 
relevancy analysis define the level of complexity required for 
knowledge representation and ultimately may dictate what 
kind of data structure is utilized. 


NASA MAI System 

The NASA Machine- Aided Indexing Project was initi- 
ated several years ago and had as its goal the development of 
two operational systems: one to cany out “subject switching” 2 , 
(i.e., the translation of terms from one controlled vocabulary 
to the terms of another); and an MAI system based on the 
analysis of natural language text (Fig. 2). The designs for both 
ofthese were based on a “phrase structure rewrite” method, the 
historical development of which is described by Klingbiel 
(1985)\ Very simply, a phrase structure rewrite system, or 
“lexical dictionary**, is a table format and access procedure 
that provides an efficient means for the translation of single 
and multi-word phrases to a controlled vocabulary. 


SUBJECT SWITCHING 

The Translation of Terms from One 
Controlled Vocabulary to Another 
(OTIC -► NASA) 

(DOE - NASA) 

NATURAL LANGUAGE MAI 

Conceptual Analysis of Title and Abstract Text 

Figure 2. NASA MAI Project 

Although the subject switching system was used in an 
operational setting, the application of the text system was 
limited due to its unacceptably- long response time when used 
in an interactive workstation environment An additional 
problem was the slow development time and level of manual 
effort associated with knowledge base construction. In 1987, 
a re-design effort was initiated that focused on the evaluation 
of the phrase delineation process. The delineation process was 
based on the syntactic analysis of input text and thus required 
(1) that the syntactic class of each word be identified from a 
separate table, and (2) that the sequence of resulting syntactic 
classes be checked against a table of grammar rules. 



Figure 3. Design Comparison 
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In the new design the syntactic procedure was replaced by 
a simple preprocessing step and the incorporation of a proxim- 
ity limit for words concatenated during semantic-unit identi- 
fication. Preprocessing consists of breaking raw text input at 
end-punctuation (periods, colons, semicolons) and at the 
occurrence of certain ‘stop- words.’ The proximity limit is a 
constraint imposed on the word-concatenation process (that 
process functioning to identify semantic units for subsequent 
look-up in the knowledge base). It is an empirically estab- 
lished value above which the likelihood of grammatically 
‘incorrect’ word associations becomes significant (thereby 
resulting in regular errors in the final MAI output). 

This and other revisions of NASA MAI resulted in a 
system that was able to match and in some cases improve upon 
the output of the original system. In addition, response time 
was reduced by approximately 80 percent. 

The NASA system represents a good example of MAI 
process integration. In the same way that the phrase delinea- 
tion process was incorporated into the process for semantic- 
unit identification, the concatenation method used for this 
second process is integrated with the final MAI process — 
semantic-unit translation (i.e., the knowledge base look-up). 
The method for the identification of semantic-units is carried 
out through the execution of concatenation logic rules that 
dynamically incorporate information from knowledge base 
entries. Thus, in many instances, the existence of specific 
knowledge base entries directs the concatenation process. 

3. Knowledge Base Development 
Development Tools 

As was mentioned earlier, the particular form and content 
of an MAI knowledge base is dependent on the design for 
carrying out the basic system operations. To a large extent, a 
system can compensate for design tradeoffs by incorporating 
the appropriate class of entries into the knowledge base. The 
NASA system, for example, does not include a mechanism for 
word-stemming; thus, all variant forms of words are included 
in knowledge base entries. One of the primary concerns of the 
system designer, then, becomes the tradeoff between the size 
of the knowledge base file and the response time associated 
with a particular level of system complexity. 

Regardless of the specific design selected for an MAI 
system, the overall performance of these systems is largely 
dependent on the quality and comprehensiveness of their 
knowledge bases. Strict control and input from domain experts 
are critical during the development process. In fact, it may be 
the construction of the knowledge base itself which requires 
the greatest amount of development time and resources. The 
bulk of knowledge base content is repre-sented by the entries 
mapping natural language expressions to thesaurus concepts. 
The identification of those expressions may seem like an 
infinite task, especially if the thesaurus representing the 
subject domain is very large. KWIC (Key Word In Context) 
indexes of available text are oflittle use precisely because they 
are arrangements by ‘key words’ rather than the actual target 
concepts. Obviously, a review of abstract text on a case-by- 


case basis is grossly inefficient and is likewise untargeted with 
regard to domain concepts. In addition, both of these strategies 
lead to an unnecessarily large knowledge base due to the 
addition of expressions that are essentially ‘unique,’ i.e., text 
expres-sions that have a very low frequency of occurrence. A 
statistically-based text analysis tool was designed, again in 
support of the NASA project, that presents the domain expert 
with a well-filtered list of synonymous and conceptually- 
related phrases for each thesaurus concept. This tool was 
designed to satisfy three main requirements (Fig. 4): 

• The output phrases would be targeted to one specific 
concept (i .e ., the phrases considered during any parti cular 
session would be related to a single thesaurus term; thus, 
all expressions related to a particular domain concept 
could be analyzed together) 

• The output phrases would be restricted to those that 
had a high frequency of occurrence within the existing 
NASA database (thus, ‘unique* expressions would be 
screened out) 

• The phrases would be in a file-compatible form (i.e., the 
phrases would be normalized to a form which could be 
extracted by the semantic-unit identification process) 


Targeted, Candidate Synonyms for 

Meaningful Text ...Each Thesaurus Term 

File Compatible Phrase Format as Recognized 

Phrases by Semantic ID Process 

High Frequency Unique Expressions 

of Occurrence Screened Out 


Figure 4 . Requirements for Text Analysis 

The basic processing steps of the Knowledge Base 
Building Tool, or KBB can be described as follows: 

1 ) The text used for input is comprised of the titles and 
abstracts from a large ( 1 50-2000) set of bibliographic records 
related to a single thesaurus concept — standard on-line search 
capabilities are used to identify an accurate set of records. 

2) The text is copied into a file and preprocessed using 
a simple text-breaking method similar to that used in the MAI 
process. 

3) A word-concatenation process is then used to iden- 
tify all possible multi-word phrases within a maximum length 
(five). A proximity-limit for concatenation is imposed along 
with certain rules that provide syntactic filtering (which, for 
example, prevent prepositions and articles from beginning or 
ending a phrase). 

4) A count of the frequency-of-occurrence is deter- 
mined for each unique single-word and multi-word phrase. 
The words and phrases are then sorted in descending order by 
the frequency values. A lower-limit value is established, under 
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which all associated phrases are eliminated. However, there 
is a natural bias for single-words to have much higher 
frequencies than two-word phrases; which, in turn, will have 
higher frequencies than three-word phrases; etc. This can be 
dealt with in two ways. One is to simply produce five separate 
sorts, each one corresponding to a different phrase length. The 
other is to utilize a derived frequency value that effectively 
accounts for the bias. A process for determining such a value 
was recently described by Jones, Gassie, and Radhakrishnan 
(1990)\ The formula can be stated as W F N 2 , where W is the 
sum of the frequencies of the words in the phrase, F equals the 
frequency of the phrase, and N equals the number of distinct 
words in a phrase. The N 2 was an empirically established 
relation found to be optimal for their particular application. 

5) The final processing procedure serves to further 
refine the output. The phrases are checked against the existing 
knowledge base to eliminate any phrase that properly trans- 
lates to a thesaurus concept other than the one that the KBB is 
currently analyzing; and to eliminate single- words and phrases 
that have a poor or low semantic value. 
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Three-word phrase output 


Single-word output 


Table L Un-edited KBB output for METAL MATRIX 
COMPOSITES 


Sample output form the Knowledge Base Building Tool 
(KBB) is shown in Table 1 . The input consisted of titles and 
abstracts from records associated with the thesaurus concept 
METAL MATRIX COMPOSITES. The first column in this 
table list the unedited three-word phrase output. Those phrases 
selected by a subject analyst for inclusion in the knowledge 
base are indicated with asterisks (*). The second column lists 
the output that the KBB program identified as being single 
words. Several acronyms and material abbreviations have 
been recognized and flagged by a subject analyst. 

Maintenance Tools 

Statistical text analysis procedures like the KBB are 
interesting in that, an analysis of their output suggests many 
possible alternate applications. One problem associated with 
the maintenance of MAI knowledge bases, arises from the fact 
that the conceptual domain, as represented by the controlled 
vocabulary, is not static. New terms are regularly added and 
integrated into existing conceptual hierarchies. The areas of 
the laiowledge base requiring modification in response to such 
changes cannot be easily inferred — particularly if the file 
happens to be very large (the NASA file currently has over 
113,000 entries). 

A modification of the KBB program, the KBM (Knowl- 
edge Base Maintenance) routine, provides a tool for the 
identification of affected entries. The final procedure in the 
KBB process was altered to allow phrases already translating 
to a thesaurus term to be included in the final output and 
flagged for easy recognition. Text expressions which may 
been mapped to an existing thesaurus term in lieu of the newly 
established term will be evident in the program output 


4. Other Applications and Future Directions 

The KBM program has more recently been used as an aid 
to thesaurus construction. The phrase output provides some 
guidance in identifying trends in the lexicon of the particular 
subject area; the existing thesaurus terms associated with the 
identified phrases suggest probable hierarchy locations and 
related terms for new thesaurus entries. The program is 
particularly useful when investigating an emerging technol- 
ogy or discipline. Sample phrase output for the general area of 
ROBOTICS is shown in the first column of Table 2. The 
second column presents some output phrases that have been 
selected and conceptually organized by a lexicographer. 

Since the particular text input to the KBM is identified 
through a traditional on-line database search, the conceptual 
scope of the text to be processed can be easily modified. A 
separate corporaofinputtext associated with themore specific 
area of ROBOT VISION was processed in conjunction with 
the analysis of general ROBOTICS. Sample output is shown 
in Table 3. Synonyms have been flagged with by a 
lexicographer and other phrases relevant to the lexicon of this 
particular discipline have been flagged with (*). 

The limitation of the type of text-based MAI system 
described in this paper is that the semantic unit is restricted to 
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the level of the phrase. These systems are best suited to 
document indexing applications associated with techni cal and 
scientific domains, where the likelihood ofaphrase containing 
specific indexable content is significant. In such environments 
they provide both quality and production benefits. Some 
interesting possibilities exist for the application of these MAI 
systems to a full-text environment. The same basic design 
could be modified to capture occurrence frequencies of sug- 
gested thesaurus terms. A term weighting scheme could be 
developed that incorporated these statistical values and spe- 
cial weight values assigned to terms originating from key 
structural elements of the document, such as title, section 
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Table 2. KBM applied to thesaurus construction — output 
from text related to a broad subject field (ROBOTICS) 


headings, abstracts, etc. Such a system may very well provide 
a quality of output that would allow its use as a totally 
automatic indexing system. 
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Table 5. KBM applied to thesaurus construction — un-edited 
output from text related to a nanow subject field (ROBOT 
VISION) 
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for each thesaurus concept. Another tool, the Knowledge Base Maintenance (KBM) program, functions 
to identify areas of the knowledge base affected by changes in the conceptual domain (for example, 
the addition of a new thesaurus term). An alternate use of the KBM as an aid in thesaurus 


construction is also discussed. — 
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